feat: ConvoMem sampled adapter with range-probe selective fetch by groksrc · Pull Request #21 · basicmachines-co/basic-memory-benchmarks

groksrc · 2026-06-12T19:46:51Z

Summary

Adds the last Phase-1 benchmark: ConvoMem (Salesforce, Apache-2.0, ~75K QA pairs), as a documented stratified sample. Pre-mixed test cases map 1:1 onto grouped runner mode.

Design

Selective fetch: the dataset is multi-GB (one size-300 batch is ~850MB). Batch files are ordered by context size within each category dir; an HTTP Range tail-probe (last 4KB) indexes every file's final contextSize without downloading. Only matching files are fetched. index.json records all probe results including files not downloaded — the selection is auditable. Probes throttled + retried (HF CDN resets rapid bursts; hit and fixed live).
Documented sampling: stratified by (category, contextSize), fixed seed, sampling.json records seed + per-stratum population/sample counts. A published number states exactly which slice it covers.
Anti-leakage: containsEvidence/model_name scrubbed, conversation ids remapped to neutral positional ids (covered by test).
Ground truth maps evidence conversation ids through the remap; abstention evidence referencing absent conversations yields empty ground truth by design.

Verification

7 new tests: ground-truth mapping, leakage scrub, seed determinism (same seed → identical sample; different seed → different), stratification manifest, context-size filter.
Live run against the real dataset (user_evidence @ size 10): 50 files probed, exactly 2 downloaded, 3 cases → 30 docs/30 queries (size-10 cases pack 10 questions per shared haystack — one ingest serves 10 queries), all ground truth non-empty.
Full suite green (89 tests), lint clean.

🤖 Generated with Claude Code

ConvoMem (Salesforce, Apache-2.0, ~75K QA pairs) ships as pre-mixed test cases — self-contained conversation haystacks plus questions — that map 1:1 onto the grouped runner mode. - datasets/convomem.py: the full dataset is multi-GB (one size-300 batch is ~850MB), so fetching is selective. Batch files within each <category>/<N>_evidence/ dir are ordered by case context size, and an HTTP Range tail-probe (last 4KB) reads each file's final contextSize without downloading it. Only files matching the requested sizes are fetched; index.json records every probe result including files NOT downloaded, so the selection itself is auditable. Probes are throttled with retries (HF CDN resets rapid bursts). - converters/convomem_to_corpus.py: stratified deterministic sampling by (category, contextSize) with a fixed seed; sampling.json records seed, per-stratum population, and sample counts so a published number states exactly which slice of ConvoMem it covers. Leakage scrub: containsEvidence/model_name dropped, conversation ids remapped to neutral positional ids. Ground truth maps evidence conversation ids through the remap; abstention evidence referencing absent conversations yields empty ground truth by design. - CLI: datasets fetch --dataset convomem --context-sizes; convert convomem --sample-per-stratum/--seed/--context-sizes. justfile recipes + README section. Live-verified against the real dataset (user_evidence, size 10): 50 files probed, exactly 2 downloaded, 3 cases sampled -> 30 docs / 30 queries (size-10 cases pack 10 questions per haystack, each targeting a distinct evidence conversation), all ground truth non-empty and remapped. 7 new unit tests; suite green. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> Signed-off-by: Drew Cain <groksrc@gmail.com>

groksrc merged commit caff4fc into main Jun 12, 2026
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: ConvoMem sampled adapter with range-probe selective fetch#21

feat: ConvoMem sampled adapter with range-probe selective fetch#21
groksrc merged 1 commit into
mainfrom
feat/convomem-sampled

groksrc commented Jun 12, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

groksrc commented Jun 12, 2026

Summary

Design

Verification

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant